An Ensemble Approach for Record Matching in Data Linkage

نویسندگان

  • Simon K. Poon
  • Josiah Poon
  • Mary K. Lam
  • Qinglan Yin
  • Daniel Man-yuen Sze
  • Justin C. Y. Wu
  • Vincent C. T. Mok
  • Jessica Y. L. Ching
  • Kam-Leung Chan
  • William H. N. Cheung
  • Alexander Y. Lau
چکیده

OBJECTIVES To develop and test an optimal ensemble configuration of two complementary probabilistic data matching techniques namely Fellegi-Sunter (FS) and Jaro-Wrinkler (JW) with the goal of improving record matching accuracy. METHODS Experiments and comparative analyses were carried out to compare matching performance amongst the ensemble configurations combining FS and JW against the two techniques independently. RESULTS Our results show that an improvement can be achieved when FS technique is applied to the remaining unsure and unmatched records after the JW technique has been applied. DISCUSSION Whilst all data matching techniques rely on the quality of a diverse set of demographic data, FS technique focuses on the aggregating matching accuracy from a number of useful variables and JW looks closer into matching the data content (spelling in this case) of each field. Hence, these two techniques are shown to be complementary. In addition, the sequence of applying these two techniques is critical. CONCLUSION We have demonstrated a useful ensemble approach that has potential to improve data matching accuracy, particularly when the number of demographic variables is limited. This ensemble technique is particularly useful when there are multiple acceptable spellings in the fields, such as names and addresses.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

Cross Training and Under Sampling in Categorization of Company Announcements

Poon, S., Poon, J., Lam, M., Lin, Q., Sze, D., Wu, J., Mok, V., Ching, J., Chan, K., Cheung, W., et al (2016). An Ensemble Approach for Record Matching in Data Linkage. In Andrew Georgiou, Louise K. Schaper and Sue Whetton (Eds.), Digital Health Innovation for Consumers, Clinicians, Connectivity and Community: Selected Papers from the 24th Australian National Health Informatics Conference (HIC ...

متن کامل

Probabilistic Linkage of Persian Record with Missing Data

Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...

متن کامل

Distance Dependent Localization Approach in Oil Reservoir History Matching: A Comparative Study

To perform any economic management of a petroleum reservoir in real time, a predictable and/or updateable model of reservoir along with uncertainty estimation ability is required. One relatively recent method is a sequential Monte Carlo implementation of the Kalman filter: the Ensemble Kalman Filter (EnKF). The EnKF not only estimate uncertain parameters but also provide a recursive estimat...

متن کامل

Minimally-Supervised Attribute Fusion for Data Lakes

Aggregate analysis, such as comparing country-wise sales versus global market share across product categories, is often complicated by the unavailability of common join attributes, e.g., category, across diverse datasets from different geographies or retail chains, even after disparate data is technically ingested into a common data lake. Sometimes this is a missing data issue, while in other c...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Studies in health technology and informatics

دوره 227  شماره 

صفحات  -

تاریخ انتشار 2016